continuous action space
A Detailed Proof 1 A.1 Proof of Theorem 4.1
We can compute the fixed point of the recursion in Equation A.2 and get the following estimated Then we compare these two gaps. To utilize the Eq. 4 for policy optimization, following the analysis in the Section 3.2 in Kumar et al. By choosing different regularizer, there are a variety of instances within CQL family. B.36 called CFCQL( H) which is the update rule we used: In discrete action space, we train a three-level MLP network with MLE loss. In continuous action space, we use the method of explicit estimation of behavior density in Wu et al.
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
- North America > United States (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > Canada (0.04)
In this section, we present detailed proofs for the theoretical derivation of Thm. 1, which aims to solvethefollowingoptimizationproblem: min
These assumptions are not strong and can be satisfied in most of environments includes MuJoCo, Atarigamesandsoon. Let f be an Lebesgue integrable function, P and Q are two probability distributions, |f| C,then EP(x)f(x) EQ(x)f(x) CDTV(P,Q) (5) Proof. Suppose there are two actions a1, a2 under state s, and let Q1(s,a1) = u, Q1(s,a2) = v. In this way, we can derive the upper bound of Ea π2Q1(s,a) Ea π1Q1(s,a)asabove. Since both sides of the above equation have the same minimum (here the minima are given by Qk = Q), we can replace the objective in Problem 2 with the upper bound in Eq. (10) and solve therelaxedoptimizationproblem.
- Oceania > Australia > New South Wales > Sydney (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
- Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
- Asia > China (0.04)
- North America > Canada (0.04)
- North America > United States > Illinois > Champaign County > Champaign (0.04)
- North America > United States > Arizona > Maricopa County > Tempe (0.04)
- North America > Canada > Alberta (0.15)
- North America > Canada > Quebec (0.04)